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ABSTRACT 

The original formulation of BEAMS - Bayesian Estimation Applied to Multiple 
Species - showed how to use a dataset contaminated by points of multiple underlying 
types to perform unbiased parameter estimation. An example is cosmological parame- 
ter estimation from a photometric supernova sample contaminated by unknown Type 
Ibc and II supernovae. Where other methods require data cuts to increase purity, 
BEAMS uses all of the data points in conjunction with their probabilities of being 
each type. Here we extend the BEAMS formalism to allow for correlations between 
the data and the type probabilities of the objects as can occur in realistic cases. We 
show with simple simulations that this extension can be crucial, providing a 50% re- 
duction in parameter estimation variance when such correlations do exist. We then go 
on to perform tests to quantify the importance of the type probabilities, one of which 
illustrates the effect of biasing the probabilities in various ways. Finally, a general pre- 
sentation of the selection bias problem is given, and discussed in the context of future 
photometric supernova surveys and BEAMS, which lead to specific recommendations 
for future supernova surveys. 

Key words: BEAMS, supernova, classification, typing, machine learning, selection 
bias, biased probabilities, Bayesian 



1 INTRODUCTION 

Type la Supernovae (SNela) provided the first widely ac- 
cepted evidence for cosmic acceleration in the late 1990s 
URiess et alT]p98l [Perlmutter et al.|p99| . While they 
were based on relatively small numbers of spectroscopically- 
confirmed SNela, those results have since been confirmed 
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Next generation SN surveys such as LSST will be funda- 
mentally different, yielding thousands of high-quality candi- 
dates every night for which spectroscopic confirmation will 
probably be impossible. Creating optimal ways of using this 
excellent photometric data is a key challenge in SN cosmol- 
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ogy for the coming decade. There are two ways that one can 
imagine using photometric candidates. The first approach is 



to try to classify the candidates into la, Ibc or II SNe ( John- 
son fc Crotts|2006||Kuznetsova fc Connolly|2007l [Poznanski 



et al.|2007 Rodney fc Tonry|2009 1 and then use only those 
objects that are believed to be SNela above some threshold 
of confidence. This has recently been discussed by |Sako et al.| 
( 2011 1 who showed that photometric cuts could achieve high 
purity. Nevertheless it is clear that this approach can still 
lead to biases and systematic errors from the small contam- 
inating group when used in conjunction with the simplest 
parameter estimation approaches such as the maximum like- 
lihood method. 

A second approach is to use all the SNe, irrespective 
of how likely they are to actually be a SNIa. This is the 
approach exemplified by the BEAMS formalism, which ac- 
counts for the contamination from non-la SN data using the 
appropriate Bayesian framework, as presented in |Kunz et al.| 
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(20071, hereafter referred to as KBH. In KBH, two threads 



are woven: a general statistical framework, and a discussion 
of how it may be applied to SNela. As noted in KBH, the 
general framework can be applied to any parameter estima- 
tion problem involving several populations, and indeed may 
have already been done so in other fields. In this paper we 
take the same approach as in KBH of keeping the notation 
general enough for application to other problems, while dis- 
cussing its relevance to SNe. 

We will attempt to use the same notation as in KBH, 
but differ where we consider it necessary. For example, we 
write conditional probability functions as /e|D(6|d). The 
quantity f0^]j{9\d)A6, should be interpreted as the prob- 
ability that O lies in the interval {9, 9 + A9), conditional on 
D = d (for small A9). 

We preserve capital letters for random variables and 
lowercase letters for their observed values. In the BEAMS 
framework, one wishes to estimate parameter(s) O from 
observations of the random variable X. We will use the 
boldface X to denote a vector of N such random variables: 
X = Xi...M- An observation of X we will denote by x, so 
that the full set of N observations is denoted by cc = x\...n- 
For SNe, the observations x axe the photometric data of the 
N SNe. As such, for SNe the probability density function 
(pdf) fx\B{x\9) is the likelihood of observing the photo- 
metric data X assuming some cosmological parameters 9, 
which we will discuss. The relationship between raw photo- 
metric data (A) and the true cosmological parameters (0) is 
highly intricate, resulting in a pdf which cannot realistically 
be worked with, and so one first reduces each observation x 
to a single feature d for which there is a direct 0-dependent 
model. For SNe, if the parameters Q are for example f^A and 
Q.m, then d will consist of an estimated luminosity distance 
and redshift. If the parameter of interest is a luminosity 
distance at a given redshift, then d will be simply a fitted 
distance modulus. Unless stated otherwise, this is the case. 

The correct treatment of redshifts will be important to 
BEAMS as applied to future SN surveys. Future surveys 
will likely have only photometric information for the SNe 
but will have a spectroscopic redshift for the host galaxy 
obtained by chance (because of overlap with existing sur- 
veys) or through a targeted follow-up program. The SDSS- 



II supernova survey ( Abazajian et al. 2009 1 is an example 
of both of these. There were host redshifts available from 
the main SDSS galaxy sample and there was also a targeted 
host followup program as part of the BOSS survey. Future 
large galaxy surveys like SKA, EUCLID or BigBOSS will 
likely provide a very large number of host galaxy redshifts 
for free. 

BEAMS is unique in that the underlying types of the 
observations are not assumed known. In the case where there 
are two underlying types (T £ {^4, -B}), each observation has 
an associated type probability (P) of being type A, 

P = V[T^A\Xp), 

where Xp is a subset of features of X. In other words, Xp 
is the component of the raw data X on which type proba- 
bilities are conditional. Note that we treat P as a random 
variable: while the value of P is completely determined by 
Xp, which in turn is completely determined by X, X is a 
random variable and therefore so too is P. The realizations 
of the type probabilities P of the A'' observations are denoted 



by p = pi-.-N, and we will call them TA-probabilities. The 
TA-probability for a SN is thus the probability of being type 
la, conditional on knowing the subset a;p of the the pho- 
tometric data. Xp may be the full photometric time-series, 
the earliest segment of the SN's light curve, a fitted shape 
parameter, or any other extracted photometric information. 

Finally, we mention that the type of the SN (T) is a 
random variable with realisation denoted r. A summary of 
all the variables used in the paper is given in Table [l] 

Attempts to approximate r/i-probabilities include those 



of [Poznanski et al.| ([2002|; [Newling et al.| ( |2011 ); [Richards 
et al. (20111 and as implemented in SALT2 (Guy et al. 



20071. Note that values obtained using these methods are 



only approximations of TA-probabilities, as the algorithms 
are trained on only a handful of spectroscopically confirmed 
SNe. Note too that there is no sense in which one set of TA- 
probabilities is the correct set, as this depends on what Xp 
is. Obtaining unbiased estimates of TA-probabilities is not 
easy, and we will consider the problems faced in doing so 
in Section [T] For SNe, the problem is made especially diffi- 
cult by the fact that spectroscopically confirmed SNe, which 
are used to train TA-probability estimating algorithms, are 
brighter than unconfirmed photometric SNe. 

In 2009 the Supernova Photometric Classification Chal- 
lenge (SNPCC) was run to encourage work on SN classifica- 
tion by lightcurves alone ( Kessler et al.|2010 l. Performance 
of the classification algorithms was judged according to the 
final purity and efficiency of extracted la samples. While the 
processing of photometric data is essential to the workings of 
BEAMS for SNe, the classification of objects is not required. 
It would be interesting to hold another competition where 
entrants are required to calculate TA-probabilities for SNe. 
Algorithms would then not only need to recognise SNela, 
but would also need to provide precise, unbiased probabili- 
ties of the object being an SNela. 

In brief, this paper consists of three more or less in- 
dependent parts. In Section [2] we present an extension of 
BEAMS to the case where certain correlations, which were 
ignored in KBH, are present. In Section[3j we discuss the rel- 
evance of TA-probabilities in a broader context, and specif- 
ically the importance to of them in BEAMS. Then is Sec- 
tions [4] [5| and [6] we perform simulations to better under- 
stand the importance of sample sizes, nearness of population 
distributions, biases of TA-probabilities and decisivenesses 
of TA-probabilities (to be defined). Finally, in Section [7] we 
present new ideas from the machine learning literature de- 
scribing when and how TA-probability biases emerge and 
how to correct for them. This is then discussed in the con- 
text of the SNPCC in Section [8] 



2 INTRODUCING AND MODIFYING THE 
BEAMS EQUATIONS 

The posterior probability on the parameter(s) 0, given the 
data D, is derived in Section II of KBH as 



fe\n{e\d) (X fe{9) x (1) 
E fD\e.Tid\9,T) l[p^l[ (l-Pj), 



-re[A,B]- 
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Random Variables 


R.V. 


Data 


Definition 


P 


P 


The probability of being type A conditional 
on Xp. We call P the r^i-probability. 


D 


d 


A particular feature of an object whose distri- 
bution depends directly on the parameter (s) 
we wish to approximate using BEAMS. SNe: 
D is luminosity distance. 


T 


T 


The type of an object, T G {A, B} SNe: T (E 
{la, nia} 


X 


X 


All the features observed of an object. SNe: 
X is the photometric data. 


Xp 


Xp 


That part of the features which affects con- 
firmation probability. SNe: Xp are peak ap- 
parent magnitudes. 


Xp 


Xp 


That part of the features used to determine 
the T^-probability. SNe: Xp could be any 
reduction of X. 


F 


f 


Whether the object is confirmed or not. For 
SNe: _F = 1 if a spectroscopic confirmation is 
performed. 


P 


P 


Is exactly P if the object is unconfirmed and 
1 or if confirmed, depending on type. 



— >■ The term in the numerator can then be written as the 
sum over all 2^ possible type vectors, 



\Tid,d,p,T) 



— >■ The numerator is again modified using the definition of 
conditional probability, 



E 



fn\e,p,T{d\e,p,T)f&^P,T{0,P,T) 
fn,p{d,p) 



— > We will now assume that the probability of having ta- 
probabilities and types p and r respectively are independent 
of 0. As noted following eqn.(4) in KBH, for SNe this as- 
sumption rests on the fact that Q (that is Qm, ^a) describes 
large scale evolution, while the SN types r depend on local 
gastrophysics, with little or no dependence on perturbations 
in dark matter. 



E 



^{d\e,p,T)fe{e)fp,T{p,T) 



fD,p{d,p) 



Table 1. A description of all the random variables used in this 
paper. 



where the PiS are TA-probabilities. The summation is over all 
of the 2^ possible ways that the A'^ observations can be clas- 
sified into two classes. We will refer to the expression on the 
right of ([T]) as the KBH posterior. When the N observations 
are assumed to be independent, that is when 



i = l 

the KBH posterior reduces, 



Di\0,T, 



{d^\6,n), 



n [/o.ie.T. K|0,A)p. + /o.|e,T.(d,:ie,B)(l-pO] • (2) 

i = l 

There is one substitution in the derivation of the KBH 
posterior on which we would like to focus, given in KBH as 
eqn. (5) on page 3: 



./■t(t)= Y[p Y[ 



(3) 



Equation ||3| states that the l.h.s. prior probability of 
the SNe having types r is given by the product on the 
r.h.s. involving TA-probabilities. We argue that this prod- 
uct should not be treated as the prior /t, but rather as the 
conditional /t|p- In effect, we argue that KBH should not 
use the TA-probabilities p unless P is explicitly included as 
a conditional parameter. It is to this end that we now red- 
erive the posterior on Q, taking /e|D,p(6'|(i, p) as a starting 
point, discussing at each line what has been used. 



/e 



'\d,p) 



— >■ We will first use the definiton of conditional probability 
to obtain, 

^ fe,n,p{0,d,p) 
fn,p{d,p) 



— >■ Rearranging this, and again using the definition of con- 
ditional probability, we obtain. 



■ •^"'i? , fe(e)Y,fD\e,p,Tid\0,P,T)fTip{T\p). 
D,p(d,p) ^ 



— >■ The first term on the line above is constant with re- 
spect to Q, and so is absorbed into a proportionality con- 
stant. We now make one final weak assumption: fT\piT\p) = 
YliLi fTi\Pi{Ti\Pi)- This assumption will be necessary to 
make a comparison with the KBH posterior. Making this 
assumption we arrive at. 



«/e(e)^/D|©,i.,T(die,p, 



Hp. 



Pj) 



(4) 

We will refer to the newly derived expression Q as the full 
posterior. Let us now consider the difference between the 
KBH posterior ([T]) and the full posterior, and notice that in 
the full posterior, the likelihood of the data D is conditional 
on 0,P and T, whereas in the KBH posterior D is only 
conditional on and T. This is the only difference between 
the two posteriors, and so when D\0, T is independent of P, 
the posterior Q reduces to the KBH posterior ([TJ, making 
them equivalent. This is an important result: when D\Q,T 
and P are independent, the KBH and full posteriors are the 
same. 

Our results can be summarised as follows. 



(1) As the posterior fe\D{S\d,) is not conditional on TA- 
probabilities it should be independent of TA-probabilities, 
and we thus prefer to replace the KBH posterior in |T| by 



/e|D(e|d)(x/e(e) 
E 



fn\B,T{d\o,T) n ^ n (1-^)' 
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where tt is an estimate of the global proportion of type A 
objects. 

(2) /e|£) ^>(Sjd, p) is always given by the full posterior Q. 
When D\0, T and P are independent, it reduces to the KBH 
posterior ([T]). 



It is worth discussing for SNe the statement, "£)|0,T 
and P are not independent" . One incorrect interpretation of 
this statement is, "given that we know the cosmology is Q, 
observinfj^P for a SN of unknown type adds no information 
to the estimation of the distance modulus." Indeed it is dif- 
ficult to imagine how this could be the case: we know that 
SNela are brighter that other SNe, and therefore obtain- 
ing a TA-probability close to 1 shifts the estimated distance 
modulus downwards (towards being brighter). 

A correct interpretation of the statement is, "given the 
cosmology 0, observing P of a SN of known type adds no 
information to the estimation of the distance modulus." It 
may seem necessarily true that a TA-probability contributes 
no new information if the type of the SN is already known, 
but this is not in general the case; it depends on the method 
by which TA-probabilities are obtained. 

Currently for SNe, fitted distance moduli and approxi- 
mations of TA-probabilities are frequently obtained simulta- 
neously, using for example S ALT2 ( Guy et al.|2007 1 . This in 
itself suggests that D\Q,T and P will not be independent. 
In some cases however, TA-probabilities are calculated from 
the early stages of the light curves ( [Sullivan et al.||2006[ 
Sako et al. 2008 1 while the distance modulus is estimated 



from the peak of the light curve, and so the dependence 
may be weak. As another example, in Section 4.4 of [Newl-] 
ing et al. (2011 1 TA-probabilities are obtained directly from 



a Hubble diagram. Objects lying in regions of high relative 
SNIa density are given higher TA-probabilities than objects 
lying in low relative SNIa density. As a result, at a given 
redshift, brighter nia SNe have higher TA-probabilities than 
faint nIa SNe. Similarly, at a given fitted distance modulus 
(fitted assuming type la), nIa will lie on average at lower 
redshifts than la. Both of these cases, (distance modulus | 
O, type) being correlated with P, and (redshift | 0, type) 
being correlated with P, are precisely when D\Q,T and P 
are dependent. In Section [6] a simulation illustrating this 
dependence is presented. 

For completeness, we mention that in the case of inde- 
pendent observations, that is when, 

N 

fn\0.p,T{d\e,p,T) = ]^/o.|©,p^,T^(di|6>,pi,Ti), 

i=l 

the full posterior (jlj reduces to, 

JV 

fe\o,p{d\d,p) cxYl[fD,\e,Pi,Ti{dt\e,pi,A)pi + (5) 
1=1 

fDi\e,p„T,id,\e,pi,B) (l-p,)]. 
In Section [7] we will make suggestions as to what 

^ Of course we mean "observing" in the statistical sense, that 
is obtaining the realistation of the TA-probability (p) with some 
software 
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Figure 1. Two TA-probability distributions, both with means of 
0.5. Using a threshold of 0.6, we have on left: FPR = 0.17, FNR 
= 0.45 and on right: FPR: 0.15, FNR = 0.28 



functional form may be chosen for fDi\&.Pi.Ti when using 
BEAMS for independent SNe. 



3 RATING TA-probabilities 

An object's TA-probability is the expected proportion of 
other objects with its features which are type A. In other 
words, if an object has features x, its TA-probability is the 
expected proportion of objects with features x which are 
type A. Suppose that the global distribution of P is /p. The 
expected total proportion of type A objects is then 



p(r=A)={P)= r p 

Jo 



fp{p)dp. 



(6) 



In some circumstances, it is necessary to go beyond calcu- 
lating TA-probabilities and commit to an absolute classifi- 
cation, as was the case in the SNPCC. In such cases the 
optimal strategy moving from a TA-probability to an abso- 
lute type (A or B) is to classify objects positively {A) when 
the TA-probability is above some threshold probability (c). 
The False Positive Rate (FPR) using such a strategy is 

FPR(/p) = P(P > c\T = B) 

Jl{l-p),fp{j>)dp 

J^\l-p)fp{p)dp' 
and the False Negative Rate is 



(7) 



(8) 



FNR(/p) = P(P < c|r = A) 
!oVfp{p)dp 
So Pf pip) dp 

For SNe the FPR is the proportion of nIa SNe which 
are misclassified, while the FNR is the proportion of SNela 
which are misclassified (missed). 

Intuition dictates that for classification problems, a use- 
ful fp will be one whose mass predominates around and 
1. That is, an fp which with high probability attaches 
decisivej^ TA-probabilities to observations. To minimize the 
FPR and FNR this is optimal, as illustrated in Figure [l] 

We will be presenting a simulation illustrating how the 
decisiveness of TA-probabilities affects the parameter estima- 
tion of BEAMS. To simplify our study of the effect of the 
decisiveness of TA-probabilities on BEAMS, we introduce a 

^ we say pi is more decisive than p2 if \pi — 0.5| > |p2 — 0.5|. 
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family of distributions: For each V £ [0.5, 1] we have the 
distribution 



(9) 



where S-p and Si-'p are (5-functions centered at V and l — V 
respectively. It is worth mentioning that we will be draw- 
ing probabilities from this distribution, which is potentially 
confusing. Drawing a observation of P from ([9| is equivalent 
to drawing it from {1 — VjV} with equal probability: 



P(P = p) = 



0.5 ifp: 
0.5 ifp: 



r 

I -p. 



If Pi is more decisive than P2, we say that the distri- 
bution f^^ is more decisive than f^^. 

On page 5 of KBH it is stated that the expected propor- 
tion of type A objects ^ determines the expected error in 
estimating a parameter which is independent of population 
B. Specifically, they present the result that the expected er- 
ror when estimating a parameter jj, with A'^ objects using 
BEAMS is given by, 



oc ^{P) N. 



(10) 



It should be noted that the the result from KBH ( 10 1 
is an asymptotic result in A''. For small A^, the decisiveness 
of the probabilities plays an important part. If Q were the 
only factor determining the expected error (cr^), then /°'^ 
would be equivalent to in terms of expected error. This 
would mean that perfect type knowledge does not reduce 
error, which would be surprising. An example in Section [XTj 
illustrates that decisiveness does play a role in determining 
the error. 

As mentioned on page 8 of KBH, the effect of biases 
in Tyi-probabilities on BEAMS can be catastrophic. Therein 
they consider the case where there is a uniform bias (a) of the 
TA-probabilities. That is, if observation i has a claimed ta- 
probability pi of being type A, then there is a real probability 
Pi — a that it is type A. KBH show how, by including a free 
global shift parameter, such a bias is completely removed. 
However it is not clear what to do if the form of the bias 
is unknown. For example, it could be that there is an 'over- 
confidence' bias, where to obtain the true TA-probabilities 
one needs to transform the claimed priors (p) by 



0.2 + 0.6p. 



(11) 



Introducing a bias such as the one defined by (111 will 



have no effect on the optimal FPR and FNR, provided 
the probability threshold is chosen optimally. This is be- 



cause (111 is a one-to-one biasing, and so a threshold (c) 



on biased probabilities results in exactly the same parti- 
tioning as a threshold in the unbiased space of 0.2 + 0.6c. 



However, introducing a bias such as (111 does have an ef- 



fect on BEAMS parameter estimation, as we show in Sec- 
tion |5] In Section [7] we discuss how to guarantee that the 
TA-probabilities are free of bias. 



4 EFFECTS OF DECISIVENESS AND 
SAMPLE SIZE ON BEAMS 

In this section we will perform simulations to better under- 
stand the key factors in BEAMS. The data generated will 




Figure 2. Contour plot of h{N,P). The solid lines are approxi- 
mations to lines of constant h, of the form (|13||. 



have the following cosmological analogy: Q - distance mod- 
ulus at a given redshift zq; d - the fitted distance moduli of 
SNe at zq. Furthermore, D\0, T and P will be independent, 
such that the KBH and full posterior are equivalent. 



4.1 Simulation 1: Estimating a population mean 

This simulation was performed to see how the performance 
of BEAMS is affected by the decisiveness of TA-probabilities, 
and by the size of the data set. The two populations {A and 
B) were chosen to have distributions. 



fD\T{d,r) = Normal(^T, 1), 



(12) 



where fiA = — 1 and hb = +1, as illustrated in Figure [3] 
The TA-probability distribution is chosen to be so that 
about half of the observations have a TA-probability of P, 
with the remaining observations having TA-probabilities of 
1 — P. By varying P we vary the decisiveness. 

Let us make it clear how the data for this simulation 
is generated. First, a TA-probability (p) is selected to be 
either P with probability 0.5 or 1 — P with probability 0.5, 
that is according to f^. Second, the type of the observation 
is chosen, with probability p it is chosen as A, and with 
probability 1 — p it is chosen as B. Finally, the data (d) is 



drawn from (12 1. Notice that D\T is independent of P, and 
so the KBH posterior is equivalent to the full posterior. 

In this simulation we only estimate /xa, with all other 
parameters known. We use the following Figure of Merit to 
compare the performance with different sample sizes (A'') 
and decisivenesses {P): 

h(N,P) 



{flA - ma) 



where fiA is the maximum likelihood estimate of ha 
using the KBH posterior on a sample of size N with TA- 
probabilities from f^, and {•) denotes an expectation. Values 
of h were obtained by simulation, illustrating in Figure[2]the 
performance of BEAMS for various (A'^, P) combinations. A 
good approximation to the FoM h in Figure |2] appears to be 



h(N, P)^N ( 0.32 + 1.44(P - -)' 



(13) 
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-2-10 1 2 3 

//3I7- and observations d 

Figure 3. Above are the population A (left) and population B 
(right) distributions, with (for Simulation |4.2[ l the observed val- 
ues of D drawn from these distributions shown as vortical lines 
beneath. 



although this is an ad hoc observation. One interesting ob- 
servation is that h{N,V = 1) « h{1.5N,V = 0.5) in the 
region illustrated in Figure |2] This says that given a com- 
pletely blind sample {V = 0.5), and the option to either 
double its size (A'^ — >■ 2A'') or to discover the hidden types 
{V : 0.5 — >■ 1), doubling its size will provide more informa- 
tion about ^A- It is important to reiterate that, according to 
previously mentioned result of KBH, in the limit of TV — )■ oo 
we do not expect V to play any part in determining h{N, V). 
That is, for A'^ sufficiently large, the FoM will be independent 
of -p. 

While this simulation is too simple to make extrapo- 
lations about cosmological parameter estimations from, it 
may suggest that the information contained in unconfirmed 
photometric data may be currently underestimated. 

4.2 Simulation 2: Estimating two population 
means 

The two population distributions for this simulation are the 
same as those presented in Simulation 1 and as illustrated 
in Figure [3] In this simulation, we leave both the population 
means as free parameters to be fitted for. Twenty objects are 
drawn from the types A and B, with the rA-probabilities are 
drawn from f^. The simulation is done with five difi'erent 

V values. The TA-probabilities are illustrated in Figure [Ij 
and the approximate shape of the posterior marginals of ha 
for each V value are illustrated in Figure [5] by MCMC chain 
counts. 

There are two interesting results from this simulation. 
The first is that there is negligible difference in performance 
between V — 1 and V — 0.7, so that having a 30% type un- 
certainty for all objects as opposed to absolute type knowl- 
edge does not weaken the results. The second is that as 

V approaches 0.5, BEAMS still correctly locates the pop- 
ulation means but is unsure which mean belong to which 
population. 



5 EFFECTS OF Ta-PROBABILITIES BIAS ON 
BEAMS 

In the previous section we considered the effect of the deci- 
siveness of TA-probabilities on the performance of BEAMS. 
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Figure 4. For values of V from 1 (above) to 0.5 (below), a ta- 
probability of "P or 1 — "P is attached to each observation. 




Figure 5. MCMC chain counts, approximating the posterior dis- 
tributions of fiA for the different values of decisiveness, P. 



In this section we will consider the effect of using incor- 
rect TA-probabilities. We will again be estimating fiA and 
fiB where they are —1 and 1 respectively, and the popu- 
lation variances are again both known to be 1. The true 
TA-probability distribution will be that is 



P(P = p) 



0.5 ifp = 0.8 
0.5 ifp = 0.2 



It is worth reminding the reader that we are draw- 
ing probabilities from a probability distribution, an unusual 
thing to do. To generate a TA-probability from this dis- 
tribution, one could flip a coin, and return p = 0.2 if H 
and p = 0.8 if T. We consider the effect of biasing TA- 
probabilities generated in such a manner in the following 
ways: 

) P =^ {0, 1}. Here the decisiveness of the TA-probabilities 
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of KBH, is that a flat TA-probability shift in confidence 
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Figure 6. The 99 % posterior confidence regions using the five 
biasings of the r^-probabilities, as described in Section |5] 



0.8 



1 and p = 0.2 



is overestimated, so that 
p = 0. 

(ii) P => {0.4,0.6}. Here the decisiveness of the ta- 
probabihties is underestimated, so that p = 0.8 — )■ p = 0.6 
and p = 0.2 — ^ p = 0.4. 

(iii) V => {0,0.6}. Here the TA-probabihties are underesti- 
mated by 0.2, so that p = 0.8 — >■ p = 0.6 and p — 0.2 p — 
0. 

(iv) P {0.4, 1}. Here the TA-probabihties are overestimated 
by 0.2, so that p = 0.8 p = 1 and p = 0.2 p = 0.4. 

(v) V U . Here, to each TA-probabihty a uniform random 
number from [—0.2, 0.2] is independently added. 

The 99% posterior confidence regions obtained using 
these biased TA-probabilities in a simulation of 400 points 
are illustrated in Figure ([6|. The underestimation of deci- 
siveness |(ii)| has little effect on the final confidence region, 
but overestimating the TA-probability decisiveness (i) results 
in a 6cr bias. Note that overestimating decisiveness results in 
the estimate (/IajAb) being biased towards This 
is caused by type B objects which are too confidently be- 
lieved to be type A, which pull jiA towards ^b, and type 
A objects which are too confidently believed to be type B, 
which pull pLB towards /ia- 

The contrast in effect between underestimating and 
overestimating the decisiveness of TA-probabilities is inter- 
esting, and not easy to explain. One suggestion we have 
received is to consider the cause of the observed effect as be- 
ing analogous to the increased contamination rate induced 
by overestimating the decisiveness in the case BEAMS is 
not used. With an increased contamination rate comes an 
increased bias, precisely as observed in Figure [6] It is worth 
mentioning that underestimating the decisiveness is not en- 
tirely without effect, as simulations with more pronounced 
drops in V (0.95 — > 0.55) result in noticable increases in the 
size of the 99% confidence region. 



The effect of the flat TA-probability shifts (iii) and (iv) 
introduce biases larger than 4a". This case was considered 
in KBH where, as we have already mentioned, it was shown 
that simultaneously fitting for this bias completely compen- 
sates for it. While this is a pleasing result, one would prefer 
to know that the TA-probabilities are correct, as one cannot 
be sure what form the biasing will take. 

One phenomenon which is observed in this simulation, 
as it was in simulations as summarised in Table H on page 



towards being type B (iii) does not bias the estimate of /xa 
as much as it does the estimate of /is, and vica versa. In 
other words, underestimating the probabilities that objects 
are type A will result in less biased population A parame- 
ters than overestimating the probabilities. This result may 
also be understood in light of an analogy to increased con- 
tamination versus reduced population size in the case where 
BEAMS in not used. 

Finally, we notice that in this simulation the addition of 



unbiased noise to the TA-probabilities (v) has an insignificant 
effect. This suggests that systematic biases should be the 
primary concern of future work on the estimation of TA- 
probabilities. 



6 WHEN GIVEN TYPE, THE DATA IS STILL 
DEPENDENT ON r-PRIORS 

In this section we consider for the first time a simulation in 
which the data is not drawn from /j3|t, but from fn\T,p, so 
that there is a dependence of the data on the TA-probability 
even when the type is known. The conditional pdfs are 
shown in Figure [7| To clarify the difference between this 
simulation and the previous ones, prior to this data was 
simulated as follows: 

P T\P D\T, 

where at the last step, the data was generated with a de- 
pendence only on type. Now it will be simulated as: 

P T\P D\P,T. 

More specifically, to generate data we start by drawing a 
TA-probability from j"'^, 



P(P = p) = 



0.5 
0.5 



if p : 
if p : 



0.7 
0.3. 



Note that the above distribution guarantees that P{T — 
A) = |. When the TA-probability (p) has been generated, 
we draw a type (t) from {A, B} according to 



p(r = t) 



p 

i-p 



iiT = A 

ifr^B. 



Once we have p and t, we generate d. The marginals 
/D|p,T(rf|p, t) have been chosen such that we have 



fD\Tid\A) = Normal (-1, 1) 
/o|T(diP) = Normal (1,1), 



(14) 
(15) 



as before. The marginal /£)|p.T(dj0.7, A) is composed of 
the halves of two Gaussian curves with different as, chosen 
such that the tail away from the B population is longer than 
the one towards the B population. Specifically, 



fD\p.T{d\0.7,A) 



Kexp 
K exp 



■'-§{d+ir 



ff d< 
ff d> 



where A' is a normalizing constant. The marginal 
fD\p.Tid\O.S, A) is then constructed to guarantee (14 1. The 
above construction guarantees that the population of A ob- 
jects with low TA-probabilities (0.3) lie on average closer to 
the B mean than do objects with high (0.7) TA-probabilities. 
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Figure 7. Plots of /£,|p y(d|p, r) (filled curves) for p = 0.7 (light) 
and p = 0.3 (dark), and for type A (left) and type B (right). 
Overlying are folxi'^l^) (light) and f]j^rp{d\B) (dark). 



The marginals of the B population are constructed to mirror 
exactly the A population marginals, as illustrated in Fig- 
ure [3 

To compare the use of the KBH BEAMS posterior ^ 
with the full conditional posterior (|5|, we randomly draw 40 
data points from the above distribution and construct the 
respective posterior distributions, as illustrated in Figure [8] 
Observe that the KBH posterior is significantly wider than 
the full posterior. Indeed, approximately half of the interior 
of the 80% region of the KBH posterior is ruled out to 1% 
by the full posterior. It is interesting to note that, while 
the KBH posterior is wider than the full posterior, it is not 
biased. This result goes against our intuition; we believed 
that the KBH posterior would result in estimates for ha 
and fiB which exaggerated |/iA — Ms|. Whether it is a general 
result that no bias exists when the KBH posterior is used, or 
if there can exist dependencies between P and D for which 
the use of (|T| leads to a bias, remains an open question. 

Figure Is] illustrates one realistation from the distribu- 
tion we have described, but repeated realisations show that 
on average, the variance in the maximum likelihood estima- 
tor using the KBH posterior is ~ 3 times larger than the 
variance using the modified posterior. While these simula- 
tions are too simple to draw conclusions about cosmological 
parameter estimation from, they do suggest that where cor- 
relations between TA-probabilities and distance moduli exist 
within a class of SNe, it may be worthwhile accounting for 
it by using the modified posterior. Currently it is most com- 
mon when modelling SNe for cosmology, to assume that the 
likelihood /D|e,T(rf|^, t) is a Gaussian with unknown mean 
and variance, 

D\d, P,T^ Normal(/i(6', T),a{Tf). 

If one wishes to include the TA-probabilities in the like- 
lihood, one could include a linear shift in P for the mean or 
variance. That is, 



D\e,P,T = Normal(^i(6l,r) -f- ciP,a{TY + C2P). 

Of course this is just one possibility, and one would need 
to analyse SN data to get a better idea of how P should enter 
into the above equation. 
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Figure 8. Posterior distributions on the parameters (/^a, A'b) 
using the correct posterior (|4]l (solid) and the KBH posterior ([TJ 
(dashed). The KBH posterior assumes independence between D\T 
and P. Plotted are the 80%, 95% and 99% confidence levels. The 
true parameters (orange point) lie within the 95 % confidence 
regions of both posteriors. 



7 OBTAINING UNBIASED 
r^-PROBABILITIES 

In this section we investigate likely sources of ta -probability 
biases such as those presented in Section |5j and discuss how 
to detect and remove them. For SNe, one source of ta- 
probability bias could be the failure to take into account 
the preferential confirmation of bright objects. This type of 
bias has been considered in the machine learning literature 
under the name of selection bias, and we here present the 
relevant ideas from there. We end the section with a brief 
discussion on how one could model the pdfs fD\e,p,T and 
fD{e,F,p.T, which are the likelihoods appearing in the ex- 
tended posteriors introduced in Section [2] 



7.1 Selection Bias 

With respect to classification methods, selection bias refers 
to the situation where the confirmed data is a non- 
representative sample of the unconfirmed data. A selection 
bicis is sometimes also referred to as a covariate shift al- 
though the two are defined slightly differently, as described 
in Bickel et al. (20071. With selection bias, the confirmed 
data set is first randomly selected from the full set, and 
then at a second stage it is non-randomly reduced. Such 



is the situation with a population census, where at a first 
stage, a random sample of people is selected from the full 
population, and then at a second stage, people of a certain 
disposition cooperate more readily than others, resulting in 
a biased sample of respondees. 

A form of selection bias which is well known in observa- 
tional astronomy is the Malmquist bias, whereby magnitude 
limited surveys lead to the preferential detection of intrinsi- 
cally bright (low apparent magnitude) objects. In the case 
of SN cosmology, the bias is also towards the confirming of 
bright SNe. A reason for this bias is that the telescope time 
required to accurately classify a SN is inversely proportional 
to the SN's brightness. It is therefore relatively cheap to con- 
firm bright objects and expensive to confirm faint ones. 

If the SN confirmation bias is ignored, certain infer- 
ences made about the global population of SNe are likely to 
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be inaccurate. In particular, estimates of a classifier's False 
Positive and False Negative Rates will be biased, and the 
estimated TA-probabilities will be biased in certain circum- 
stances, as we will discuss in the following section. 

7.1.1 Formalism 



The decision to confirm a SN can be dictated by different 



Following where possible the notation of Fan et al. (2005 I, in 



what follows we assume that variables {X, T, F) are drawn 
from X X T X T, where 

(i) X is the feature space, 

(ii) T — {A, B} is the binary type space, 

(iii) J- = {0, 1} is the binary confirmation space, where F = 1 
if confirmed (_F for foUowed-up) . 

A realisation {x, r, /) lies in either the test set or the training 
sets, defined respectively as; 



test set = {(x, T, /) s.t. / = 0} 
training set {{x,T,f) s.t. / - 



!}• 



For SN cosmology it could be that X, T and are respec- 
tively, 

(i) X is the space of all possible photometric data, where a 
SN's photometric data consists of apparent magnitudes and 
observational standard deviations in four colour bands over 
several nights. 

(ii) T — {la, nia}, type la and non-la SNe. 

(iii) J- = {0, 1}, where F = 1 if the SN has been spectroscopi- 
cally confirmed and thus has its type known. 

By having a training set be unbiased we mean that it is a 
representative sample of the test set, specifically that F is 
independent of both X and T. That is, the probability of 
confirmation is independent of features and type: 



P{F = 1\X ^x,T^t)= P{F = 1) 



(16) 



When the training set is unbiased, training set and test 
set objects are drawn from the same distribution over X xT. 
This distribution over X x T can be estimated from the 
training set, so directly providing an estimate of the more 
useful test set distribution. 

There are three important ways in which the indepen- 
dence relation (|16[) can break down, resulting in a biased 



training set, as described in Zadrozny (20041 and listed be- 



low. By removing bias from a training set, we mean reweight- 
ing the training points such that the training set becomes 
unbiased. 

(i) Confirmation is independent of features only when condi- 
tioned on type: F\ T and X are independent. This is the 
simplest kind of biasing, and there are methods for correct- 
ing for it ( |Bishop|[T996t , ( |Elkan|[200T| . This is not the bias 
which exists in SN data. 

(ii) Confirmation is independent of type only when given fea- 
tures: F\X and T are independent. If the decision to confirm 
is based on X and perhaps some other factors which are in- 
dependent of T, this is the bias which exists. This is probably 
the bias which exists in SN data, and there are methods for 
correcting for it, as we will discuss. 

(iii) Confirmation depends on both features and type simulta- 
neously. In this case, it is not possible to remove the bias 
from the data unless the exact form of the bias is known. 



features, examples include Sako et al. (20081; Sullivan et al 



( 2006 1, all of which are contained in the photometric data X. 



Such was the also case in the SNPCC where the probability 
of confirmation was based entirely on the peak magnitude 
in the r and i f, as we will discuss in Section [S] In reality, 
there are other factors which affect the confirmation decision 
such as the weather and telescope availability, but these are 



independent of SN type. Therefore the type (ii) bias above is 
the bias which exists in the SN data. Thus, for the remainder 
of this section we'll assume the type (ii) bias, that is 



P{F ^ 1\X ^ x,T = t) ^ P{F = 1\X 



(17) 



The assumption of the type (ii) bias can be made stronger. 
The decision to confirm an object does not in general depend 
on all of X but only a low-dimensional component (Xp) of 
it, and so we have 



P(F= 1\X = x,T - 



P{F = 1\Xf = xf) 



(18) 



where Xf is contained in X. For SNe, Xf could be the peak 
apparent magnitude in certain colour bands. 

In the following subsection we will describe how to cor- 
rectly obtain TA-probabilities under the assumption of a bias 



described by ( 18 1 



7.2 Correctly obtaining TA-probabilities 

Let us remind the reader as to how we defined TA- 
probabilities in the introduction: 



TA-probability = P{Ti = A\Xp^i = xp^i) = pi 



(19) 



where Xp^i is an observable feature of the ith object, 
extracted from Xi. Estimates of pi values can be obtained 
using several methods, of which those mentioned previously 
are|Poznanski et al.| (|2002[ ) ; [Newling et al.| pOTT] l; |Richards| 



et al. (20111; Guy et al. (20071 It is worth rementioning 



that these different methods attempt to estimate different 
probability functions, as they each condition on different 
SN features. Thus there is no sense in which one set of TA- 
probabilities estimates is the correct set. 

We now make an adjustment to definition ( |19[ ), to take 
into account that biased follow-up may result in an addi- 
tional conditional dependence on F: 

TA-probability P(T, ^ A\F, = f„Xp,, = xp,,) = P,. 

(20) 

The most informative TA-probabilities one could use 
would be those conditional on all of the features at one's 
disposal, 



Xf 



X 



Pi 



P{T ^ A\F = 0,X 



(21) 



However, when X is a. high-dimensional non- 
homogeneous space, as is the case with photometric 



SN data, it can be difficult to approximate (21 1 accurately. 
It is for this reason that it is necessary to reduce the 
features to a lower dimensional quantity Xp £ Xp, so 
that the TA-probabilities are calculated from a subspace 
(Xp) of the full feature space, as described by (201. The 



subspace Xp should be chosen to retain as much type 
specific information as possible while being of a sufficiently 
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low dimension. In the SNPCC Newling et al. (20111 chose 



— >Then using ( 25 I , we have 



Xp to be a 20-dimensional space of parameters obtained by 
fitting lightcurves. 

The job of obtaining estimated r^-probabilities for test 
set objects {F — 0) is one of obtaining an estimate of the 
type probability mass function, 



/t 



(22) 



Again, for ( 22 1 we prefer not to use the standard mass func- 



tion notation, in order to to neaten certain integrals which 
follow. The TA-probability of a test set object can now be 
expressed in the following way, 

P(r = A\F = 0,Xp = xp) = /t|f.Xp(A!0,xp). 

Using kernel density estimation, boosting, or any other 
method of approximating a probability function, one can 
construct an approximation (/) of the type probability func- 
tion for training set objects, 



(23) 



Using the estimate / in ( 23 1 one can estimate the ta 



probabilities for the training set objects: 

P(r = A\F = 1, Xp = xp) ^ f{xp). 



(24) 



The estimate ( 24 1 is not directly important as the train- 



ing set object types are known exactly. But it is only through 
the training set objects that we can learn anything about the 
types of the test set objects. 

How / from the training set is related to fT\F=o,Xp i 22 1 



depends on the relationship between Xp (the data which de- 
termines confirmation probability) and Xp (the data used to 
calculate TA-probabilities). There are two cases to consider. 
The first, which we write as Xp C Xp, is when the data 
which determines confirmation probabilities is completely 
contained in the data used to calculate TA-probabilities. 
That is, 



Xp C Xp 



dcf 



P{F = l\Xp = xp) = P(_F = l\Xp = xp) 



The second case, when Xp (f. Xp is when not all confir- 
mation information is contained in Xp, 

Xp ^ Xp P(F = l\Xp = xp) / P(F = l\Xp = xp) 

In the case of Xp C Xp, it can be shown that, 

P(F = 1|T = r, Xp = xp) = P{F = l\Xp = xp). (25) 



7.2.1 Xp C Xp 

We will show that in the case of Xp C Xp, a type probability 
function approximating the training population (/) is an 
unbiased approximation for the type probability function of 
the test population {F — 0). To show this we start with the 
type probability of a test object: 



P(r = r\F = 0,Xf 



Xp 



Using Bayes' Theorem, we have 

_ P(f = Ojr ^r,Xp^ xp) ■ P(r = t|Xp = xp) 
^ P(F = 0|Xp = xp) 



P{F = 0\Xp = xp) ■ P(T = r|Xp = xp) 

P{F = 0\Xf = xp) 
P(r = r|Xp = xp). 



(26) 



— >■ Using the same steps as above but in reverse and with 
F = 1, we arrive at 

= P(r = riF= l,Xp =a;p). 

— >■ This is the type probability function for training set ob- 
jects, and it can be approximated: 



(27) 



This is a useful result, as it says that / is not only 
an approximation of the type probability function of the 
training data, but also of the test set. Thus, / should provide 
unbiased TA-probabilities for the test set when Xp C Xp. 

It should be noted that for / to be a good approximation 
for the test set, it is necessary that the training set covers all 
regions of Xp where there are test points. That is, if there 
are values of xp for which P(Xp = a::p|F=l)=0 and 
P(Xp = xp\F = 0) ^ 0, then the approximation / will not 
converge to fT\p=o,Xp as the training set size grows. One 



( 2005 1 for a full treatment of this 



can refer to Fan et al. 
topic. 

With respect to SNe, the requirement of the preceding 
paragraph is that, if a SN is too faint to be confirmed and to 
enter the training set, it should not enter the test set either. 
We will return to this point again in Section [S] 

One important question which we do not attempt to 
answer here is, how many SNe of different apparent mag- 
nitudes should be confirmed to obtain as rapid as possible 
convergence of / to /t|f=o,Xp- An interesting method for 
deciding which SNe to confirm may be one based on the 
real-time approach proposed in Preund et al. (19971, where 



the decision to add an object to the training set is based on 
the uncertainty of its type using the currently fitted /. In 
Section |8] we discuss this further. 



7.2.2 Xp (f. Xp 

If Xp (f. Xp we will not be able to use / to estimate the TA- 
probabilities in the test set, as ( |27[ ) required Xp C Xp. In 
addition to this problem of not being able to use / to obtain 
unbiased TA-probabilities for the test set objects, \iXp Xp 
then 

p(r = TjXp = xp) / P(r|Xp = xp, Xp = xp). 

This tells us that there is additional type information to be 
obtained from Xp, and so by not including Xp one is wast- 
ing type information. For this reason we recommend recon- 
structing the TA-probabilities based on redefined features, 
Xp <— {Xp,Xp). 

However, it is possible that one explicitly does not want 
to use Xp in calculating TA-probabilities. This may be the 
case if one wishes to reduce the dependence between D and 
P, as presented in Section [2] For SNe, this may involve ob- 
taining TA-probabilities from shape alone, independent of 
magnitude, so that Xp is a space whose dimensions describe 
only shape and not magnitude. In this case, as we cannot 
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use /, we need to use the relationship derived in Shimodaira 



(20001 



p(r = T|F = o,Xp) 

= / fT,Xp\F,Xp{r,XF\l,Xp) ■ ■w{xF,Xp)dXF, (28) 

where the weight function is defined as 

. . fF\Xp{0\xF)fF\Xp{l\xp) 

w{xF,xp) = - 7?,ur^- 29 

Jf\Xf(Mxf)Jf\Xp(0\xp) 



Notice that if Xp C Xp, then w{xf,xp) — 1 and so ( 28 1 
reduces to the type probability function for training set 
objects, approximated by / as expected from ( |27[ ). When 
w{xF,xp) 7^ 1, the training set type probability function 
/ cannot be used directly as an approximation to the test 
set type probability function. However, if each training set 



object is weighted using ( 28 1, then an unbiased test set type 



probability function approximation can be obtained. 

The weight function ( |29[ ) does not require any type in- 
formation and so can be estimated as a first step. This ad- 
ditional step of estimation introduces additional error into 



the final estimate of (22 1, a theoretical analysis of which is 



presented in Cortes et al. ( 2008 1 . An alternative to the two- 



stage approach would be to fit the two terms in ( 28 1 simulta- 



neously, as suggested and described by ( Bickel et al. 



The use of 1 29 1 was first suggested in Shimodaira (20001, 



20071 



where a detailed analysis of the asymptotic behaviour of its 



approximation is given. Therein, it is suggested that ( 29 1 be 
approximated by kernel density estimation. 

In the case where F and Xp are independent, the weight 
function reduces to one of only Xp, 



w(Xf — Xp) 



P{F = 0\Xp = xp)P{F = 1) 



(30) 



P(F = 1\Xf = xf)P{F = 0)' 

This reduction in dimension may be valuable in approx- 
imating the weight function. 

7.3 Detecting and removing biases in 
TA-probabilities 

In the previous section we presented the correct way in which 
to estimate TA-probabilities in the case Xp Xp. In this 
section we will present an example illustrating this process, 
but in the context of bias removal. 

Suppose that we have a program which outputs scalar 
values (p) , which are purported TA-probabilities. We believe 
that the output values have some unspecified bias, which 
we wish to remove. An assumption we make is that the p 
values are calculated in the same way for training and test 
sets. That is that the program does not process cases = 
and F = \ differently. It may seem strange to be interested 
in what the program does when F = 1, but as already men- 
tioned it is only from the training set that we can learn 
anything about the test set. The idea now is to treat the 
received p values as the xp% from the previous section, and 
not directly as TA-probabilities. 

For this example, we choose Xp = [0, 1]. To now 
transform a test set value p £ [0, 1] into an unbi- 
ased TA-probability using ( |28| l, one needs to estimate cer- 
tain probability functions using kernel density estima- 
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Figure 9. Realisations of a training set (left) containing type 
A (red pluses) and type B (blue points) objects, and a test set 
(right), drawn according to l |31| . Overlaid are faint lines delineat- 
ing the discrete regions described by | |31| 



tion. The necessary functions we see from ( 28 1 and 
(29| are fT.Xp\F,Xp(r,XF\l,p), fF\Xp{'^,XF), fp\xp{0,p), 
fF\Xp{0,XF) and /f|Xp (0, sp). 

It is an interesting and important question as to how 
accurately these probability functions can be approximated 
with few data points, but for this example we assume them 
known, 



fT,Xp\F,XpiA,XF\l,p) = I 

fF\Xp{0,XF) = (1 - xp), 
fF\Xpi'^,XF) = Xp, 

fplXpim = fFlXpiOm ^ 



if ^Xp < p < 1 



Xp 

2- Xp if p > 1 
if p < lx%. 



2 J-F, 



(31) 



Realisations from the above distribution are illustrated 
in Figurejg] By integrating xp out of fT,Xp\p,Xp{A,xp\l,p) 
in ( |31[ ), we have that 

P(T = yl|F = l,Xp =p) =p. (32) 

That is, in the training set p is an unbiased estimate of a 
TA-probability. The TA-probabilities for objects in the test 



set we estimate using ( 28 1 



P(T = A\F = 0,Xp = p) 

= / fT,Xp\p,Xp{'r,xp\Q,xp)dxp 

J Xp 



Xp 



T,Xp\P,X, 



, (t, a;p|l, Xp) w{xp,p) dxi 



.fT,Xp\p,Xp{r, a;p|l,a;p; 



1 — a;p 



dxr. 



Xp 



\/2p-p ifp<0.5, 
2 - p - V2 - 2p if 0.5 < p. 



(33) 



The TA-probabilities ( 32 1 and ( 33 1 are plotted in 



Figure |10[ where we see that p provided accurate TA- 
probabilities for the training set, but not for the test set. 
This is not unexpected in reality, where the program pro- 
viding the TA-probabilities may have been trained only on 
the biased training data. It is important to remember that 
this bias should only arise when Xp Xp. 
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Figure 10. Corrected r^-probabilities. The disproportionately 
large number of training SNe with decisive TA-probabilities (as 
depicted in Figure [9]l , causes p values to be too confident as test 
set TA-probability estimates. 



8 SUPERNOVA SURVEYS AND THE SNPCC 

The SNPCC provided a simulated spectroscopic training 
data set of approximately 1000 known SNe. The chaUenge 
was then to predict the types of approximately 20 000 other 
object^ from their lightcurves alone. Since the end of the 
competition, the types of all the simulated SNe have been 
released, making a post competition autopsy relatively easy 



to perform. In the results paper Kessler et al. ( 2010 1 we see 



that the probability that a SN was confirmed was based on 
the r-band and i-band quantities. 



band 



band 
dcf '^ncak 
X — f r- 



?UTband 
min 



where rrip^^k is the band-specific apparent magnitude of 
a SN, and M^^'^ and m^^^ are constants. In 
(20101 it is given that for col — r and col = i, 



Kessler et al. 



eo 1 



dcf ?Ttp^ak - 16.0 

5.5 

dcf rrip^^y. — 21.5 
2X) 



(34) 



where eo is some constant. Once Egpcc and tspoc have been 
calculated, if a [0 — >■ 1] uniform random number is less than 
either of them, confirmation is performed. As confirmation 



depends only on tspcc and tspcc, we have from (261 that 
P(r = r|F = 0,m^,,k,' 



ipcak) = = t\F = l,mp;,ak,'Tlpeak) 

(35) 



Equation ( 35 1 can be interpreted as saying that the ratio 
lamla is the same in a given Jripeaki "^pcak bii^- The man- 
ner in which the follow-up was simulated should of course 



guarantee that (35 I holds. In theory one should be able to 
deduce the verity of ( |35[ ) from Figure |11[ but the redshift 
bins with large numbers of confirmed SNe are too sparsely 
populated by unconfirmed SNe to check that the Ia:nla is 
invariant. To be in a position where ( |35[ ) can be checked is 
in general an unrealistic luxury, as without the types of the 
test objects this is impossible. 

In terms of obtaining accurate TA-probabilities, a dis- 
turbing feature of Figure [TT] is the absence of training SNe 
with high apparent magnitudes. With no training SNe with 

^ These lightcurves are available at http://sdssdp62.fnal.gov/ 
sdsssn/SIMGEN_PUBLIC / 
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Figure 11. Counts of confirmed (left) and not confirmed (right) 
SNe, la (dashed) and non-la (solid) as a function of rn^^^^ (above) 
S'li'i ™peak (below). 



i-band apparent magnitudes greater than 23.5, we cannot in- 
fer the types of test SNe with apparent magnitudes greater 
than 23.5. Indeed there would be no non-astrophysical rea- 
son not to believe that all SNe with apparent magnitudes 
greater than 23.5 are non-la. As already mentioned in Sec- 
tion [TA] in situations where the training set does not span 
the test set, one should ignore unrepresented test objects 
from all analyses. All test SNe other than those for which 
there are training SNe of comparable peak apparent magni- 
tudes in r and i bands should be removed from a BEAMS 
analysis, unless there is a valid astrophysical reason not to 
do so. This entails ignoring about 95% of unconfirmed SNe; 
an enormous cut. We therefore consider it important to con- 
firm more faint SNe. 



In Newling et al. (2011 1, a comparison is made between 



training a boosting algorithm on the non-representative 
spectroscopically confirmed SNe and a representative sam- 
ple, randomly selected from the unconfirmed SN set. 
Therein, the authors use twenty fitted lightcurve parame- 
ters, including fitted apparent magnitudes in r and i bands. 
This corresponds to the situation discussed in Section [7. 2. 1[ 
where Xp C Xf. For this reason, the probability density 
function / in 



24 



as estimated by their boosting algorithm 
should be an unbiased estimate for fT\F=o,Xp- But being 
unbiased does not guarantee low error, and when trained on 
the confirmed SNe, regions of parameter space correspond- 
ing to high apparent magnitude had no training SNe with 
which to learn, and so the approximation of |22] was poor. 
However when trained on the representative set, every re- 
gion of populated parameter space was represented by the 
training set, and the approximation of [22] was greatly im- 
proved. 



In their paper, Richards et al. (2011 1 describe their en- 



try in the SNPCC, and they report how a semi-supervised 
learning algorithm performs better with a few faint training 
SNe than with many bright ones. The comparison was per- 
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formed while keeping the total confirmation time constant. 
Thus their conclusion was the same as ours; that it is im- 
portant to obtain a more representative SN training sample. 



9 CONCLUSIONS AND 
RECOMMENDATIONS 

In this paper we discussed BEAMS, and extended the KBIf 
posterior probability function to the case when D\T (dis- 
tance modulus I type) and P (type probability) are depen- 
dent. In Section [6] we considered an example where the de- 
pendence between D\T and P is strong, and observed a large 
reduction in the posterior width using the extended poste- 
rior as opposed to the KBH posterior. No bias is observed 
when using either the extended or the KBH posterior. 

In Section [4] we considered examples where the KBH 
posterior is valid, that is when D\T and P are independent. 
We performed tests to ascertain the importance to BEAMS 
of i) the decisiveness of the TA-probabilities (observations 
of P), and ii) sample size. In one test (4.1 1, we observed 
how doubling a sample size reduces error in parameter es- 
timation more than obtaining the true type identity of the 



objects does. In another test (4.2 1, we observed how BEAMS 



accurately locates two population means, but fails to match 
each mean to its population. 

We looked at the effects of using biased TA-probabilities 
in Section ([5|. The result of KBH, that TA-probability bi- 
ases towards population A affect the population's parame- 
ter estimates less than biases in favour of population B, was 
observed. A similar result which is uncovered is that biases 
towards high decisiveness are more damaging than biases 
towards low decisiveness. In other words, it is better to be 
conservative in your prior type beliefs than too confident. 

Our recommendations for BEAMS may thus be sum- 
marised as follows. Firstly, the inclusion in the likeli- 
hood function of TA-probabilitiescan dramatically reduce the 
width of the final posterior, providing tighter constraints 
on cosmological parameters. Secondly, conservative estima- 
tion of TA-probabilities is less harmful than too decisive an 
estimation. Thirdly, it is possible to remove biases in ta- 
probabilities using the techniques described in Section 

In Section [7] we considered the problem of debiasing 
TA-probabilities. Interpreting recent results from the ma- 
chine learning literature in terms of SN cosmology, we dis- 
cussed the different ways in which training sets can be bi- 
ased and how to remove such biases. The key to under- 
standing and correcting biases is the relationship between 
Xf and Xp, where Xf are object features which determine 
confirmation probability, and Xp are those features which 
determine TA-probabilities. In brief, when Xp contains Xp, 
TA-probabilities should be unbiased, but if this is not the 
case, there are sometimes ways for correcting the bias. 

With respect to future SN surveys, we emphasize the 
importance of an accurate record as to what information is 
used when deciding whether or not a SN is confirmed. Us- 
ing this information, one should in theory be able to remove 
all the affects of selection bias when Xp (f. Xp. In other 
words, using all the variables which are considered in de- 
ciding whether to follow-up a SN, it will always be possible 
to obtain unbiased TA-probabilities, irrespective of what the 
TA-probabilities are based on. Such follow-up variables may 



include early segments of light curves, x goodness of fits, fit 
probabilities, host galaxy position and type, expected peak 
apparent magnitude in certain filters, etc. 

Our second recommendation for SN surveys is that 
more faint objects are confirmed. While it not necessary for 
most machine learning algorithms to have a spectroscopic 
training set which is exactly representative of the photo- 
metric test set, it is necessary that the spectroscopic set at 
least covers the photometric set. Thus having large numbers 
of faint unconfirmed objects without any confirmed faint ob- 
jects is suboptimal. 
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APPENDIX A: POSTERIOR TYPE 
PROBABILITIES 

We here derive the posterior type probabilities based on the 
modifications of Section [2] The posterior type probability 
will be derived, conditional on D and P. This derivation can 
be easily extended to posterior type probabilities conditional 
on D, F and P. 



= / M\e,n.p{A\e,d,p)feiD,F{e\d,p)d9 
Je 

= / M\e,D,,pAA^'di,pi)fQin,p{d\d,p)d 



we have assumed that the objects are independent, 

fD,\e,p.,T,{d4e,p,,A)fT,ie,p,{A\e,p,) 



fD,\Q,pSM^^Pi) 



X fe\n,pWd,P)de 



we have used B ayes' Theorem, 
A, 



A, + Bi 



fein,pid\d,p) de. 



(Al) 



where Ai = P{di\9,pi,Ti = A)p^, Bi = P{di\e,pi, 
Ti — _B)(1 — Pi), and we have assumed used that 
fTi\e,Pi{A-\9,Pi) =Pi. 



If the posterior fe\n,p confines S to a region sufficiently 
small such that Ai and Bi are approximately constant, then 
the posterior type probability ( |A1| | is well approximated by 

Ai{§)/ (^Ai{9) + Bi{9)^ where 9 is the maximum likelihood 
estimator of fe\n,p{d\d,p). Furthermore, the posterior odds 
ratio, 



dcf fT,\n,p{A\d,p) 

posterior odds ratio — — ^ ^ 

fTi\n,p[B\d,p) 
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can be shown to be given by the prior odds ratio multiplied 
by the Bayes Factor, 



posterior odds ratio ; 



Pi 



1 - Pi 



fDi\e.p,,T,{dz\e,pi,B) 



APPENDIX B: ADDITIONAL CONDITIONING 
ON THE CONFIRMATION OF SUPERNOVA 
TYPE 

In this paper we did not distinguished between the contri- 
butions of unconfirmed and confirmed objects to the pos- 
terior. While we can calculate approximate TA-probabilities 
for confirmed objects, these values should not enter the pos- 
terior, but be replaced by (if type B) or 1 (if type A). Let 
us introduce the random variable F to denote whether an 
object is confirmed, so that _F = 1 if confirmed and _F = if 
unconfirmed. With this introduced, we wish to replace the 
TA-probabilities p by p, where, 

[ P. if.A = 0, 
Pi = < 1 if /i = 1 and n = A, 
[ if /i = 1 and n = B. 

We must be careful to let the new information which we 
introduce in p be absorbed elsewhere in the posterior. To 
this end, as we did in Section ?? we start afresh the pos- 
terior derivation, explicitly including the vector (/) which 
describes which objects have been foUowed-up. Doing this, 
we arrive at the following posterior distribution 

fe\D,F,p{e\d,f,p)(xf@{0) X (Bl) 

fD\0.F,P,T{d\O,f,P,T) l[p^l[ (1-p,). 

The new information (/) has been absorbed into the 
likelihood, fu\ - - For a particular application, one may now 
ask if the addition of in is necessary. We have al- 

ready mentioned that for SNe D\0, T is unlikely to be inde- 
pendent of P. It is also unlikely that D\9,T is independent 
of F, as bright SNe, which have lower fitted distance moduli 
at a given redshift, are confirmed more regularly than faint 
ones. However, it is possible that by additionally condition- 
ing D on P this confirmation dependence is broken, so that 
D\9, P, T and F are independent. We leave this as an open 
question. 



In the case of independent SNe, the posterior (Bll re- 
duces to 

N 

f@\n,F,p{d\d,f,p) oc Y\ [fDi\e,Fi,Pi,Ti {di\e, fi,p„ A)pi + 

fDi\e,Fi,Pi,Ti{di\e,f„Pi,B) (1 - Pi)]. 

(B2) 
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